LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

Oral Presentation To Appear in NAACL 2025

Sachit Kuhar ¹, Wasi Uddin Ahmad ²*, Zijian Wang ¹, Nihal Jain¹, Haifeng Qian²*, Baishakhi Ray¹, Murali Krishna Ramanathan¹, Xiaofei Ma¹, Anoop Deoras¹

¹AWS AI Labs, ²NVIDIA

* Work done at AWS AI Labs

Paper

TL; DR

Recent advancements in code completion models have often overlooked the evolving nature of public libraries. LIBEVOLUTIONEVAL bridges this gap by providing a benchmark that rigorously tests Large Language Models on version-specific code completion across multiple public libraries such as PyTorch, Matplotlib, SciPy, and more. Our experiments show that model performance can vary significantly when libraries undergo rapid version changes—APIs might be introduced, deprecated, or modified. We demonstrate that providing version-aware context, such as retrieved documentation, helps but does not fully solve the inherent challenges. We also show that embedding models themselves exhibit bias toward certain library versions, highlighting the complexity of handling evolving public libraries in real-world development settings.

[LIBEVOLUTIONEVAL Teaser Image]

Abstract

Recent advancements in code completion models have primarily focused on either single-file contexts or entire repositories, overlooking the critical challenge of fast-evolving public libraries. As libraries introduce, deprecate, or modify APIs across versions, a model’s performance can vary significantly based on the year and version of the library in use.

We propose LIBEVOLUTIONEVAL, a benchmark and detailed study that explicitly evaluates code language models under version-specific code completion tasks. Spanning multiple years and covering popular Python libraries, it provides both real-world GitHub-based examples and controlled, documentation-based tasks. Our experiments with popular code LLMs and embedding models reveal that addressing library evolution demands version-aware context. While retrieval of version-specific documentation can improve code completions, it does not fully resolve version-dependent biases in the models, indicating the need for specialized training techniques to handle rapid library changes.

We hope that LIBEVOLUTIONEVAL inspires further research that incorporates temporal and version-specific knowledge into code completion models, enabling them to reflect real-world software development more accurately.

Dataset Construction

We gather real-world code usage from permissively licensed GitHub repositories and map each snippet to its corresponding library version via requirements.txt and other metadata. To enable controlled experimentation, we also generate synthetic tasks derived from official library documentation.

Each code completion prompt is annotated with the library version, focusing on introduced, deprecated, modified, or unchanged APIs. This annotation strategy allows us to assess how models handle evolving libraries in detail.

[Dataset Construction Process]

Dataset Statistics

Feature	Assorted	PyTorch	Matplotlib
# API Documentations	-	29.4K	35.6K
# Eval Examples	4.5K	20.1K	10.1K
Avg. # lines in prompt	66.25	104.91	84.36
Avg. # tokens in prompt	732.06	1149.34	995.91

Table 2: LIBEVOLUTIONEVAL statistics

Main Results

Figure below illustrates the code completion performance (F1 score) of the StarCoder2, Mistral, and GPT-4o-Mini models. As libraries evolve, these code LLMs show significant variations in their ability to complete code correctly, indicating the need for version-aware context.

After examining overall trends from the polar plot, we present a more detailed breakdown of results (see table below) focusing on two representative libraries — PyTorch and Matplotlib. We evaluate three prompting strategies: In-File (not version-aware), Version-Aware, and Version-Aware RAG.

Model	Completion Strategy	Context Setting	PyTorch	Matplotlib
Starcoder2-7B	Fill-in-the-Middle	In-File (Not Version-Aware)	68.8	69.7
		+ Version-Aware	69.3	70.1
		+ Version-Aware RAG	73.3	75.4
Mistral-7B	Left-Context Only	In-File (Not Version-Aware)	65.8	60.18
		+ Version-Aware	66.04	61.2
		+ Version-Aware RAG	67.6	69.05
GPT-4o-Mini	Instruction-based (w/ Example)	In-File (Not Version-Aware)	64.3	52.5
		+ Version-Aware	64.78	53.1
		+ Version-Aware RAG	70.14	66.7

We observe that performance generally improves with additional version-related context or retrieval.

Analysis

We present a collection of sub-figures that delve into crucial aspects of model performance, covering scaling effects, in-file vs. version-aware retrieval, direct vs. indirect completions, and more.

[CodeSage Scaling] — MRR Scores for CodeSage Small and Large Models

[StarCoder Scaling] — F1 Scores for Starcoder2 and StarCoder Models

[In-File vs RAG] — In-File vs Version-Aware RAG Performance

[Direct vs Indirect Completions] — Direct vs Indirect Code Completion

[Overall vs Deprecated] — Overall vs Deprecated Set Code Completion

[MRR Scores CodeSage/OpenAI Ada] — MRR Scores for CodeSage / OpenAI Ada

Temporal Analysis

In addition to overall performance, we examine how models handle rapidly changing libraries over time. Below, three tables illustrate scenarios for Matplotlib (deprecated vs. introduced APIs) and PyTorch (introduced or deprecated) from different years:

StarCoder2 (Model Release: 2024) on Matplotlib Deprecated APIs
Version Year	Deprecated API Score	Overall Score
2019	38.58	61.83
2020	43.93	60.12
2021	53.01	61.31
2022	52.74	65.15
2023	57.14	66.08

CodeGen 1.0 (Knowledge Cutoff: 2022) on Matplotlib Introduced APIs
Version Year	Introduced API Score	Overall Score
2020	53.14	62.89
2021	54.23	62.85
2022	56.29	60.04
2023	44.08	59.44
2024	41.37	58.57

StarCoder2 (Knowledge Cutoff: 2024) on PyTorch Introduced/Deprecated APIs
Version Year	Deprec./Intro. API Score	Overall Score
2018	68.78	71.06
2020	59.10	75.45
2021	68.13	72.84
2022	60.93	71.02
2023	67.13	72.82

BibTeX

@inproceedings{kuhar-etal-2025-libevolutioneval,
    title = "{L}ib{E}volution{E}val: A Benchmark and Study for Version-Specific Code Generation",
    author = "Kuhar, Sachit  and
      Ahmad, Wasi Uddin  and
      Wang, Zijian  and
      Jain, Nihal  and
      Qian, Haifeng  and
      Ray, Baishakhi  and
      Ramanathan, Murali Krishna  and
      Ma, Xiaofei  and
      Deoras, Anoop",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.348/",
    pages = "6826--6840",
    ISBN = "979-8-89176-189-6",
    abstract = "Recent advancements in code completion models have primarily focused on local file contexts. However, these studies do not fully capture the complexity of real-world software development, which often requires the use of rapidly-evolving public libraries. To address this gap, we introduce LibEvolutionEval, a comprehensive study that emphasizes the need to understand library evolution to perform accurate in-line code completions. LibEvolutionEvaloffers a version-specific code-completion task across eight libraries as they evolve over the years, along with an in-depth analysis of the evolution of two widely used and well-maintained public libraries: PyTorch and Matplotlib. We evaluate several popular models and find that public library evolution significantly affects their performance. To mitigate this, we explored how retrieving version-specific library documentation and prompt-based techniques can enhance model capability in dealing with these fast-evolving packages. This suggests a promising path forward for better handling fast-evolving libraries. Our tasks will be made publicly available upon acceptance."
}